In-class Exercise 5: Geographically Weighted Logistic Regression (GWLR) and Application

Overview

In this in-class exercise, I utilized the following Geographical Weighted Logistic Regression (GWLR) algorithm:

  • Weighting functions (kernel)

  • Weighting schemes

  • Bandwidth

Getting Started

Setting the Scene

  • To build an explanatory model to discover factor affecting water point status in Osun State, Nigeria (one of the LGAs with the highest percentage of non-functional waterpoints, over 40% which is more than the national average)

  • Study area: Orun State, Nigeria

  • Data sets:

    • Osun.rds contains LGAs boundaries of sun State. It is in sf polygon data frame, and

    • Osun_ wp_sf.rds contained water points within Osun State. It is in sf point data frame.

Model Variables

  • Dependent variable: Water point status (i.e. functional/non-functional)

  • Independent variables:

    • distance_to_primary_road,

    • distance_to_secondary_road,

    • distance_to_tertiary_road,

    • distance_to_city,

    • distance_to_town,

    • water_point_population,

    • local_population_1km,

    • usage_capacity,

    • is_urban,

    • water source clean

Note: All variables are continuous, except for the last 3 variables (categorical).

Setting the Analytical Tools

The code chunk below installs and loads the following R packages:

  • sf

  • tidyverse

  • funModeling

  • corrplot

  • ggpubr

  • sf

  • spdep

  • GWmodel

  • tmap

New packages:

  • caret (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting, pre-processing, feature selection, model tuning using resampling, variable importance estimation etc.
  • skimr provides summary statistics about variables in data frames, tibbles, data tables and factors. It is opinionated in its defaults, but easy to modify. In base R, the most similar functions are summary() for factors and data frames and fivenum() for numeric factors.
  • blorr offers tools for building and validating binary logistic regression models.
pacman::p_load(sf, tidyverse, funModeling, blorr, corrplot, ggpubr, spdep, GWmodel, tmap, skimr, caret)

Importing the Analytical data

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")

Exploratory Data Analysis (EDA)

The code chunk below uses freq() to see the breakdown of the ‘status’ column by functional and non-functional waterpoints, represented by ‘True’ and ‘False’ respectively.

Osun_wp_sf %>%
  freq(input = 'status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

Importing Nigeria LGA level boundary

The code chunk below plots the status of the waterpoints on the Nigeria LGA level map:

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(Osun)+
#  tmap_options(check.and.fix = TRUE)+
  tm_polygons(alpha=0.4)+
tm_shape(Osun_wp_sf)+
  tm_dots(col="status",
          alpha = 0.6)+
  tm_view(set.zoom.limits = c(9,12))

Observation: From the map above, we can see that both functional and non-functional waterpoints are quite evenly scattered across LGAs in Nigeria. To understand the logical relationship between status of waterpoints and the LGAs, we can do further analysis on different data variables available.

Summary Statistics using skimr

The code chunk below generates summary statistics with skim() of the skimr package:

Osun_wp_sf %>%
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

Observation: From the results above, we can see that install_year and fecal_coliform_value have excessive number of missing values, thus should not be used in our analysis to prevent data loss.

  • Note to think of extent of data loss: number of loss in comparison to total number of data points

The code chunk below is to exclude all missing values + change data type into factor:

Osun_wp_sf_clean <- Osun_wp_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))

Note: usage_capacity is no longer numeric (assume to be continuous) in nature and has become a factor. Osun_wp_sf_clean dataset has 4 less records than Osun_wp_sf.

Correlation Analysis

The code chunk below drops irrelevant data columns:

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39,42:43,46:47,57)) %>%
  st_set_geometry(NULL)
# OR st_drop_geometry()

The code chunk below takes all numerical fields to do correlation:

cluster_vars.cor = cor(
  Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Observation: From the results above, there are no variables that are highly correlated (>=0.8). Thus, there is no sign of multi-collinearity between these variables.

Building a Logistic Regression Model

glm of R is used to calibrate a logistic regression for the water point status.

model <- glm(status ~ distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city + 
               distance_to_town +
               is_urban +
               usage_capacity + 
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = "logit"))

Instead of using a typical R report, blr_regress() of blorr package is used.

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

From the report above, we can exclude the distance_to_primary_road and distance_to_secondary_road as their p-value > 0.05, so we cannot reject null hypothesis and the test result is not statistically significant.

Model Assessment of Non-Geographically Weighted Logistic Regression Model

The validity of a cut-off (e.g. 0.5) is measured using sensitivity, specificity and accuracy.

  • Sensitivity: The % of correctly classified events out of all events = TP / (TP + FN)

  • Specificity: The % of correctly classified non-events out of all non-events = TN / (TN + FP)

  • Accuracy: The % of correctly classified events out of all events = (TP + TN) / (TP + FP + TN + FN)

Using the code chunk below, we can get the confusion matrix:

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

Observation: From the results, we can see that the model gives us an accuracy of 67.4%, which is better than guessing (50%).

The sensitivity and specificity are 72.1% and 61.5% respectively, which shows that the true positives are slightly higher than the true negative prediction rates.

Building Geographically Weighted Logistic Regression

Converting from sf to sp data frame

The code chunk below converts all variables laid out in the Model Variables section from sf dataframe to sp dataframe:

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status,
           distance_to_primary_road,
           distance_to_secondary_road,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           usage_capacity,
           is_urban,
           water_source_clean)) %>%
  as_Spatial()
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, usage_capacity, is_urban, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,           1000,        0,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,            300,        1,   Protected Spring 

Building fixed bandwidth GWR model

The code chunk below computes the distance matrix using ggwr() of the sp package:

bw.fixed <- bw.ggwr(status ~
distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city + 
               distance_to_town +
               is_urban +
               usage_capacity + 
               water_source_clean +
               water_point_population +
               local_population_1km,
              data = Osun_wp_sp,
              family = "binomial",
              approach = "AIC",
              kernel = "gaussian",
              adaptive = FALSE,
              longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2889 
       1        -2836 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5684.357 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2875 
       1        -2818 
       2        -2810 
       3        -2808 
       4        -2808 
       5        -2808 
Fixed bandwidth: 59200.13 AICc value: 5646.785 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2847 
       1        -2781 
       2        -2768 
       3        -2765 
       4        -2765 
       5        -2765 
       6        -2765 
Fixed bandwidth: 36599.53 AICc value: 5575.148 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2798 
       1        -2719 
       2        -2698 
       3        -2693 
       4        -2693 
       5        -2693 
       6        -2693 
Fixed bandwidth: 22631.59 AICc value: 5466.883 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2720 
       1        -2622 
       2        -2590 
       3        -2581 
       4        -2580 
       5        -2580 
       6        -2580 
       7        -2580 
Fixed bandwidth: 13998.93 AICc value: 5324.578 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2601 
       1        -2476 
       2        -2431 
       3        -2419 
       4        -2417 
       5        -2417 
       6        -2417 
       7        -2417 
Fixed bandwidth: 8663.649 AICc value: 5163.61 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2436 
       1        -2268 
       2        -2194 
       3        -2167 
       4        -2161 
       5        -2161 
       6        -2161 
       7        -2161 
       8        -2161 
       9        -2161 
Fixed bandwidth: 5366.266 AICc value: 4990.587 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2157 
       1        -1922 
       2        -1802 
       3        -1739 
       4        -1713 
       5        -1713 
Fixed bandwidth: 3328.371 AICc value: 4798.288 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1751 
       1        -1421 
       2        -1238 
       3        -1133 
       4        -1084 
       5        -1084 
Fixed bandwidth: 2068.882 AICc value: 4837.017 
 Iteration    Log-Likelihood:(With bandwidth:  4106.777 )
=========================
       0        -2297 
       1        -2095 
       2        -1997 
       3        -1951 
       4        -1938 
       5        -1936 
       6        -1936 
       7        -1936 
       8        -1936 
Fixed bandwidth: 4106.777 AICc value: 4873.161 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2036 
       1        -1771 
       2        -1633 
       3        -1558 
       4        -1525 
       5        -1525 
Fixed bandwidth: 2847.289 AICc value: 4768.192 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -1941 
       1        -1655 
       2        -1503 
       3        -1417 
       4        -1378 
       5        -1378 
Fixed bandwidth: 2549.964 AICc value: 4762.212 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1874 
       1        -1573 
       2        -1410 
       3        -1316 
       4        -1274 
       5        -1274 
Fixed bandwidth: 2366.207 AICc value: 4773.081 
 Iteration    Log-Likelihood:(With bandwidth:  2663.532 )
=========================
       0        -1979 
       1        -1702 
       2        -1555 
       3        -1474 
       4        -1438 
       5        -1438 
Fixed bandwidth: 2663.532 AICc value: 4762.568 
 Iteration    Log-Likelihood:(With bandwidth:  2479.775 )
=========================
       0        -1917 
       1        -1625 
       2        -1468 
       3        -1380 
       4        -1339 
       5        -1339 
Fixed bandwidth: 2479.775 AICc value: 4764.294 
 Iteration    Log-Likelihood:(With bandwidth:  2593.343 )
=========================
       0        -1956 
       1        -1674 
       2        -1523 
       3        -1439 
       4        -1401 
       5        -1401 
Fixed bandwidth: 2593.343 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2620.153 )
=========================
       0        -1965 
       1        -1685 
       2        -1536 
       3        -1453 
       4        -1415 
       5        -1415 
Fixed bandwidth: 2620.153 AICc value: 4761.89 
 Iteration    Log-Likelihood:(With bandwidth:  2576.774 )
=========================
       0        -1950 
       1        -1667 
       2        -1515 
       3        -1431 
       4        -1393 
       5        -1393 
Fixed bandwidth: 2576.774 AICc value: 4761.889 
 Iteration    Log-Likelihood:(With bandwidth:  2603.584 )
=========================
       0        -1960 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2603.584 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2609.913 )
=========================
       0        -1962 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1410 
       5        -1410 
Fixed bandwidth: 2609.913 AICc value: 4761.831 
 Iteration    Log-Likelihood:(With bandwidth:  2599.672 )
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
Fixed bandwidth: 2599.672 AICc value: 4761.809 
 Iteration    Log-Likelihood:(With bandwidth:  2597.255 )
=========================
       0        -1957 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1403 
       5        -1403 
Fixed bandwidth: 2597.255 AICc value: 4761.809 
bw.fixed
[1] 2599.672

We need to remove 2 statistically insignificant data columns (p-value > 0.05): distance_to_primary_road and distance_to_secondary_road

gwlr.fixed <- ggwr.basic(status ~
                           distance_to_tertiary_road +
                           distance_to_city +
                           distance_to_town +
                           is_urban +
                           usage_capacity +
                           water_source_clean +
                           water_point_population +
                           local_population_1km,
                         data = Osun_wp_sp,
                         bw = bw.fixed,
                         family = "binomial",
                         kernel = "gaussian",
                         adaptive = FALSE,
                         longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -2035 
       1        -1773 
       2        -1636 
       3        -1562 
       4        -1531 
       5        -1531 
gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-18 01:13:22 
   Call:
   ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city + 
    distance_to_town + is_urban + usage_capacity + water_source_clean + 
    water_point_population + local_population_1km, data = Osun_wp_sp, 
    bw = bw.fixed, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-129.368    -1.750     1.074     1.742    34.126  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.540e-01  1.055e-01   3.354 0.000796
distance_to_tertiary_road                 1.001e-04  2.040e-05   4.910 9.13e-07
distance_to_city                         -1.764e-05  3.391e-06  -5.202 1.97e-07
distance_to_town                         -1.544e-05  2.825e-06  -5.466 4.60e-08
is_urbanTRUE                             -2.667e-01  7.474e-02  -3.569 0.000358
usage_capacity1000                       -6.206e-01  6.966e-02  -8.908  < 2e-16
water_source_cleanProtected Shallow Well  4.947e-01  8.496e-02   5.823 5.79e-09
water_source_cleanProtected Spring        1.279e+00  4.384e-01   2.917 0.003530
water_point_population                   -5.098e-04  4.476e-05 -11.390  < 2e-16
local_population_1km                      3.452e-04  1.779e-05  19.407  < 2e-16
                                            
Intercept                                ***
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
water_point_population                   ***
local_population_1km                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.9  on 4746  degrees of freedom
AIC: 5708.9

Number of Fisher Scoring iterations: 5


 AICc:  5708.923
 Pseudo R-square value:  0.129406
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2599.672 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -2.7666e+02 -3.9783e+00  2.9182e+00
   distance_to_tertiary_road                -1.9928e-02 -3.5892e-04  8.9443e-05
   distance_to_city                         -3.0660e-02 -5.6112e-04 -1.0323e-04
   distance_to_town                         -3.4651e-03 -4.3065e-04 -1.2343e-04
   is_urbanTRUE                             -2.9760e+02 -3.1713e+00 -1.4841e+00
   usage_capacity1000                       -4.5108e+01 -1.0254e+00 -3.8922e-01
   water_source_cleanProtected.Shallow.Well -1.0341e+02 -4.2500e-01  5.9754e-01
   water_source_cleanProtected.Spring       -7.8506e+02 -5.4098e+00  2.5517e+00
   water_point_population                   -3.5392e-02 -2.0844e-03 -1.1257e-03
   local_population_1km                     -5.7871e-02  4.0301e-04  9.9909e-04
                                                3rd Qu.      Max.
   Intercept                                 1.0630e+01 1090.6880
   distance_to_tertiary_road                 5.3807e-04    0.0139
   distance_to_city                          1.2644e-04    0.0128
   distance_to_town                          2.2117e-04    0.0160
   is_urbanTRUE                              8.9563e-01  737.2336
   usage_capacity1000                        3.5141e-01    5.8909
   water_source_cleanProtected.Shallow.Well  1.8017e+00   52.2295
   water_source_cleanProtected.Spring        6.5109e+00  151.6551
   water_point_population                    1.9314e-04    0.0567
   local_population_1km                      1.6812e-03    0.0293
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 3053.711 
   AIC : 4500.019 
   AICc : 4759.787 
   Pseudo R-square value:  0.532677 

   ***********************************************************************
   Program stops at: 2022-12-18 01:13:55 

From the results above, AIC reduced from 5708.9 (non-geographically weighted regression model) to 4500.0 (geographically weighted regression model), thus there was an improvement in the regression model.

Model Assessment of Geographically Weighted Logistic Regression Model

Converting SDF into sf data.frame

To assess the performance of gwlr, firslty, we will convert the SDF object in as dataframe by using the code chunk below:

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will be labelling yhat values greater or equal to 0.5 into 1 and else 0. The result of the logic comparison operation will be saved into a field called most.

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5,T,F))

Model Assessment

Originally, y data column is logi data while yhat is numerical data. Thus, we need to use as.factor() to convert both columns to factors for comparison.

gwr.fixed$y <-as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1792  302
     TRUE    322 2340
                                          
               Accuracy : 0.8688          
                 95% CI : (0.8589, 0.8783)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7341          
                                          
 Mcnemar's Test P-Value : 0.4469          
                                          
            Sensitivity : 0.8477          
            Specificity : 0.8857          
         Pos Pred Value : 0.8558          
         Neg Pred Value : 0.8790          
             Prevalence : 0.4445          
         Detection Rate : 0.3768          
   Detection Prevalence : 0.4403          
      Balanced Accuracy : 0.8667          
                                          
       'Positive' Class : FALSE           
                                          

Observation: From the results above, we can see improvement in the accuracy (from 67.4% to 86.9%), as a result of improvements in both sensitivity (from 72.1% to 84.8%) and specificity (from 61.5% to 88.6%).

Thus, geographically weighted is a better model, given the improvement in accuracy and increase in true negative (i.e. specificity) which allows for better explanation of non-functional waterpoint. To better manage the waterpoints, we need to look a local strategy for specific LGAs (in this case, we are looking at Osun).

Visualising gwLR

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
           status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)

Visualising coefficient estimates

The code chunks below is used to create an interactive point symbol map.

tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun)+
  tm_polygons(alpha = 0.1)+

tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1)+
  tm_view(set.zoom.limits = c(8,14))

prob_T
tertiary_TV <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "distance_to_tertiary_road_TV",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8,14))
tertiary_TV
Variable(s) "distance_to_tertiary_road_TV" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.